The objectives of this work is to analyze data from universities in the US by applying unsupervised learning techniques, such as PCA, FA and clustering. Our intentions with the project is that we could separate observations, i.e. universities, according to the type (public/private) or even to how good the university is. This would gave us a wider look and, hopefully, a better understanding of the post-secondary education in the United States.
I have considered a dataset that can be found in the web of the Integrated Postsecondary Education Data System, IPEDS. This dataset was also considered in the USNEWS for the ASA Statistical Graphics Section’s of 1995 Data Analysis Exposition. This dataset contains information on over 1000 American colleges and universities. The dataset can be found in the following link: https://nces.ed.gov/ipeds/datacenter/InstitutionByGroup.aspx. The following variables have been considered:
| Variable | Information |
|---|---|
| FICE | Federal ID number |
| X | College name |
| State | Postal code of the state |
| Public/private | Indicator of public=1, private=2 |
| Av_Math_SAT | Average Math SAT score |
| Av_Verbal_SAT | Average Verbal SAT score |
| Av_Comb_SAT | Average Combined SAT score |
| Av_ACT_score | Average ACT score |
| 1Q_Math_SAT | First quantile - Math SAT |
| 3Q_Math_SAT | Third quantile - Math SAT |
| 1Q_Verbal_SAT | First quantile - Verbal SAT |
| 3Q_Verbal_SAT | Third quantile - Verbal SAT |
| 1Q_ACT | First quantile - ACT |
| 3Q_ACT | Third quantile - ACT |
| Apps | Number of applications received |
| Accepted | Number of applicants accepted |
| Enrolled | Number of new students enrolled |
| Top10perc | Pct. new students from top 10% of H.S. class |
| Top25perc | Pct. new students from top 25% of H.S. class |
| Full_Under | Number of fulltime undergraduates |
| Part_Under | Number of parttime undergraduates |
| In-state | In-state tuition |
| Out-of-state | Out-of-state tuition |
| Room_board | Room and board costs |
| Room | Room costs |
| Board | Board costs |
| Fees | Additional fees |
| Book | Estimated book costs |
| Personal | Estimated personal spending |
| Faculty_PhD | Pct. of faculty with Ph.D.’s |
| Faculty_Terminal | Pct. of faculty with terminal degree |
| SF_Ratio | Student/faculty ratio |
| Perc_Donate | Pct.alumni who donate |
| Instructional | Instructional expenditure per student |
| Grad_Rate | Graduation rate |
data = read.csv("collegedata.csv", header = F, sep = ",", na.strings = "*")
Before starting with the models we need to explore and modify the dataset in order to have a nice and clean input.
library(tidyverse)
library(leaflet)
library(rgdal)
library(stringr)
library(htmltab)
library(GGally)
library(factoextra)
library(fastDummies)
library(ggplot2)
library(cluster)
Once we have the data loaded, we are going to change the names of the variables and discard some of them that are not interesting for the project:
colnames(data) = c("FICE","X","State","Public_Private","Av_Math_SAT","Av_Verbal_SAT","Av_Comb_SAT","Av_ACT_score","1Q_Math_SAT","3Q_Math_SAT","1Q_Verbal_SAT","3Q_Verbal_SAT","1Q_ACT","3Q_ACT","Apps","Accepted","Enrolled","Top10perc","Top25perc","Full_Under","Part_Under","In_State","Out_State","Room_Board","Room","Board","Fees","Books","Personal","Faculty_PhD","Faculty_Terminal","SF_Ratio","Perc_Donate","Instructional","Grad_Rate")
data = data %>% select(-c(1,5,6,7,8,9,10,11,12,13,14))
Now, we take a look at the types of the variables we have, just in case we need to make modifications. In this case we check that all variables have the right type except X and Public_Private, that need to be factors. However, they are going to be changed later in the Profiles section.
Hence, now we are going to discard Room and Board variables because there exists already a variable, Room_Board, that is the sum of them. Moreover, we are going to delete Top10perc as people in there is obviously included in Top25perc. This is done in order to avoid correlations later.
str(data)
## 'data.frame': 1302 obs. of 24 variables:
## $ X : chr "Alaska Pacific University" "University of Alaska at Fairbanks" "University of Alaska Southeast" "University of Alaska at Anchorage" ...
## $ State : chr "AK" "AK" "AK" "AK" ...
## $ Public_Private : int 2 1 1 1 1 2 1 1 1 2 ...
## $ Apps : int 193 1852 146 2065 2817 345 1351 4639 7548 805 ...
## $ Accepted : int 146 1427 117 1598 1920 320 892 3272 6791 588 ...
## $ Enrolled : int 55 928 89 1162 984 179 570 1278 3070 287 ...
## $ Top10perc : int 16 NA 4 NA NA NA 18 NA 25 67 ...
## $ Top25perc : int 44 NA 24 NA NA 27 78 NA 57 88 ...
## $ Full_Under : int 249 3885 492 6209 3958 1367 2385 4051 16262 1376 ...
## $ Part_Under : int 869 4519 1849 10537 305 578 331 405 1716 207 ...
## $ In_State : int 7560 1742 1742 1742 1700 5600 2220 1500 2100 11660 ...
## $ Out_State : int 7560 5226 5226 5226 3400 5600 4440 3000 6300 11660 ...
## $ Room_Board : int 4120 3590 4764 5120 2550 3250 3030 1960 3933 4325 ...
## $ Room : int 1620 1800 2514 2600 1108 1550 NA 1960 NA 2050 ...
## $ Board : int 2500 1790 2250 2520 1442 1700 NA NA NA 2430 ...
## $ Fees : int 130 155 34 114 155 300 124 84 NA 120 ...
## $ Books : int 800 650 500 580 500 350 300 500 600 400 ...
## $ Personal : int 1500 2304 1162 1260 850 NA 600 NA 1908 900 ...
## $ Faculty_PhD : int 76 67 39 48 53 52 72 48 85 74 ...
## $ Faculty_Terminal: int 72 NA 51 NA 53 56 72 53 91 79 ...
## $ SF_Ratio : num 11.9 10 9.5 13.7 14.3 32.8 18.9 18.7 16.7 14 ...
## $ Perc_Donate : int 2 8 NA 6 NA NA 8 NA 18 34 ...
## $ Instructional : int 10922 11935 9584 8046 7043 3971 5883 NA 6642 8649 ...
## $ Grad_Rate : int 15 NA 39 NA 40 55 51 15 69 72 ...
data$Room = NULL
data$Board = NULL
data$Top10perc = NULL
We check whether if the data have missing values (NAs) or not, and where. First, we are going to create a copy, full_data, of data in order to compare later some values.
Next, using the VIM package we can see in a graph the distribution of NAs and the missing values for each variable. We see that in 4 variables the percentage of missing data is around 10-20% but in others lower than 5%. For this reason, we can consider a threshold (for example the value for Grad_Rate, because from there the distribution is more stable) for removing observations that are NA regarding the variables which percentage of NA is higher than that threshold.
# plot of NAs:
na_plot = aggr(data, col=c('#69b3a2','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of NAs","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Fees 0.210445469
## Perc_Donate 0.170506912
## Top25perc 0.155145929
## Personal 0.139016897
## Grad_Rate 0.075268817
## Room_Board 0.058371736
## Books 0.036866359
## Instructional 0.029953917
## Part_Under 0.024577573
## Faculty_PhD 0.024577573
## In_State 0.023041475
## Faculty_Terminal 0.023041475
## Out_State 0.015360983
## Accepted 0.008448541
## Apps 0.007680492
## Enrolled 0.003840246
## Full_Under 0.002304147
## SF_Ratio 0.001536098
## X 0.000000000
## State 0.000000000
## Public_Private 0.000000000
These variables are Fees, Perc_Donate, Top25perc and Personal. However, Personal is not consider because once we have deleted rows of Top25perc and Perc_Donate, the proportion of NA is below the considered threshold. It is important to consider that variable Fees is going to be part of a new variable, TotalCost, and its contribution to it is small. For this reason, we are going to use MICE to estimate the missing values. Moreover, if we omit the corresponding observations we are going to lose information of a lot of important universities.
# delete NAs of Top25perc and Perc_Donate:
data = data[!is.na(data$Top25perc),]
data = data[!is.na(data$Perc_Donate),]
# new plot of NAs:
na_plot = aggr(data, col=c('#69b3a2','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of NAs","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Fees 0.188087774
## Personal 0.070010449
## Room_Board 0.036572623
## Grad_Rate 0.034482759
## Part_Under 0.024033438
## In_State 0.018808777
## Faculty_PhD 0.017763845
## Books 0.012539185
## Faculty_Terminal 0.010449321
## Out_State 0.009404389
## Instructional 0.006269592
## Apps 0.004179728
## Accepted 0.002089864
## Enrolled 0.001044932
## X 0.000000000
## State 0.000000000
## Public_Private 0.000000000
## Top25perc 0.000000000
## Full_Under 0.000000000
## SF_Ratio 0.000000000
## Perc_Donate 0.000000000
Before using MICE, we are going to consider a new variable, Acc_Rate, that would be of interest in order to eliminate correlations between Apps and Accept, and to catalog later the universities. This is also done here in order to avoid Acc_Rate to be \(>\) 1 due to predictions of Apps and Accept. Obviously, we delete Apps and Accept.
data$Acc_Rate = data$Accepted/data$Apps
data$Accepted = NULL
data$Apps = NULL
We have different options in order to handle this NAs. But, as anticipated, we are going to give values to them by using the “Multiple Imputation” technique (MICE), in order to capture a better sample variability.
# model:
mice = mice(data, method = 'rf')
mice.com = mice::complete(mice)
# variables that have NAs:
data$Personal = mice.com$Personal
data$Grad_Rate = mice.com$Grad_Rate
data$Room_Board = mice.com$Room_Board
data$Part_Under = mice.com$Part_Under
data$Faculty_PhD = mice.com$Faculty_PhD
data$In_State = mice.com$In_State
data$Room_Board = mice.com$Room_Board
data$Out_State = mice.com$Out_State
data$Instructional = mice.com$Instructional
data$Books = mice.com$Books
data$Faculty_Terminal = mice.com$Faculty_Terminal
data$Fees = mice.com$Fees
data$Enrolled = mice.com$Enrolled
data$Acc_Rate = mice.com$Acc_Rate
# plot without NAs
na_plot = aggr(data, col=c('#69b3a2','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of NAs","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## X 0
## State 0
## Public_Private 0
## Enrolled 0
## Top25perc 0
## Full_Under 0
## Part_Under 0
## In_State 0
## Out_State 0
## Room_Board 0
## Fees 0
## Books 0
## Personal 0
## Faculty_PhD 0
## Faculty_Terminal 0
## SF_Ratio 0
## Perc_Donate 0
## Instructional 0
## Grad_Rate 0
## Acc_Rate 0
A natural question to ask is: Have sense the values predicted by MICE?. For this reason, we would make a comparison of the values of Fees (that is why we created previously full_data), because it was the variable with highest percentage of NAs, with and without MICE.
# fees with MICE:
fees_mice = ggplot() + geom_boxplot(aes(x=data$Fees, color="red")) + theme(legend.position="none")
# fees without MICE:
fees_Nmice = ggplot() + geom_boxplot(aes(x=full_data$Fees)) +
theme(legend.position="none")
# both plots:
comb_plot = ggarrange(fees_mice, fees_Nmice,
labels = c("Values of Fees with MICE", "Values of Fees in original dataset without MICE"),
ncol = 1, nrow = 2); comb_plot
We see that values are similar, so the imputation with MICE is acceptable.
Let´s get a wider look of the data by using the summary command.
summary(data)
## X State Public_Private Enrolled
## Length:957 Length:957 Min. :1.000 Min. : 32.0
## Class :character Class :character 1st Qu.:1.000 1st Qu.: 238.0
## Mode :character Mode :character Median :2.000 Median : 437.0
## Mean :1.697 Mean : 785.6
## 3rd Qu.:2.000 3rd Qu.: 942.0
## Max. :2.000 Max. :6392.0
## Top25perc Full_Under Part_Under In_State
## Min. : 9.00 Min. : 118 Min. : 1.0 Min. : 608
## 1st Qu.: 39.00 1st Qu.: 978 1st Qu.: 106.0 1st Qu.: 3076
## Median : 52.00 Median : 1707 Median : 414.0 Median : 9200
## Mean : 54.48 Mean : 3747 Mean : 953.1 Mean : 8866
## 3rd Qu.: 68.00 3rd Qu.: 4384 3rd Qu.: 1145.0 3rd Qu.:12660
## Max. :100.00 Max. :31643 Max. :21836.0 Max. :25750
## Out_State Room_Board Fees Books
## Min. : 2340 Min. :1560 Min. : 10.0 Min. : 96.0
## 1st Qu.: 6987 1st Qu.:3526 1st Qu.: 129.0 1st Qu.: 480.0
## Median : 9540 Median :4160 Median : 270.0 Median : 500.0
## Mean :10172 Mean :4323 Mean : 389.6 Mean : 549.4
## 3rd Qu.:12669 3rd Qu.:5050 3rd Qu.: 480.0 3rd Qu.: 600.0
## Max. :25750 Max. :8700 Max. :4374.0 Max. :2340.0
## Personal Faculty_PhD Faculty_Terminal SF_Ratio
## Min. : 75 Min. : 8.00 Min. : 20.0 Min. : 2.50
## 1st Qu.: 875 1st Qu.: 61.00 1st Qu.: 68.0 1st Qu.:11.50
## Median :1230 Median : 74.00 Median : 81.0 Median :13.60
## Mean :1370 Mean : 71.78 Mean : 78.5 Mean :14.16
## 3rd Qu.:1700 3rd Qu.: 85.00 3rd Qu.: 91.0 3rd Qu.:16.60
## Max. :6900 Max. :103.00 Max. :100.0 Max. :39.80
## Perc_Donate Instructional Grad_Rate Acc_Rate
## Min. : 0.00 Min. : 3186 Min. : 10.00 Min. :0.09139
## 1st Qu.:12.00 1st Qu.: 6591 1st Qu.: 52.00 1st Qu.:0.67283
## Median :20.00 Median : 8135 Median : 64.00 Median :0.77916
## Mean :21.93 Mean : 9590 Mean : 64.11 Mean :0.74508
## 3rd Qu.:30.00 3rd Qu.:10779 3rd Qu.: 77.00 3rd Qu.:0.85019
## Max. :64.00 Max. :62469 Max. :118.00 Max. :1.00000
We see that in PhD and Grad_Rate we have maximum values that exceed 100, and because the variables are percentages, we can discard those observations from the data.
data = data %>% filter(data$Faculty_PhD<=100 & data$Grad_Rate<=100)
Now that the data have reasonable values, so we are going to look for any duplicated rows, and if so, discard them:
# duplicated colleges:
table(data$X)[which(table(data$X)>1)]
##
## Augustana College Bethany College Bethel College
## 2 2 3
## Columbia College Concordia University Judson College
## 2 2 2
## Monmouth College Northwestern College Saint Francis College
## 2 2 2
## Saint Joseph's College Trinity College Union College
## 2 3 2
## University of St. Thomas Westminster College Wheaton College
## 2 2 2
# cleaned data:
data = data %>% distinct(X, .keep_all = T)
Last thing to consider before starting with the models is that we are going to store some variables as profiles, in order to later interpret better the models and to have the dataset with numeric variables. Those variables are X, Public_Private and State. Moreover, we are going to create a profile of how Elite is a college:
# college names:
data$X = as.factor(data$X)
row.names(data) = data$X
college.names = data$X
data$X = NULL
# public/private:
college.pub_priv = data$Public_Private
# states:
college.states = data$State
data$State = NULL
par(mfrow = c(1, 2))
barplot(table(college.states), col="#69b3a2", main="Distribution of Universities per State")
barplot(table(college.pub_priv), col="#69b3a2", main="Distribution of Universities per Type", sub = "1: Public / 2: Private")
We would consider that elite universities have Top25perc >= 0.95 and Acc_Rate < 0.40 (as they can be represented as outliers in the boxplot). That would tell us that being accepted in that universities is very hard and exclusive, and that the students are at least in the third quantile of best students in the country.
par(mfrow = c(1, 2))
boxplot(data$Acc_Rate, col="#69b3a2", main = "Acc_Rate boxplot")
boxplot(data$Top25perc, col="#69b3a2", main = "Top25perc boxplot")
# elite universities:
getElite = function(df){
vector = c()
for (i in 1:nrow(df)){
if (data$Acc_Rate[i] <= 0.40 & data$Top25perc[i] >= 90){
vector[i] = 1
}
else{
vector[i] = 0
}
}
vector
}
college.elite = getElite(data)
barplot(table(college.elite), col="#69b3a2", main="Elite universities", sub = "0: Not Elite / 1: Elite")
Also, we are going to consider another profile, Region, that groups states in regions in the US, as established by the U.S. Census Bureau (https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf).
# regions:
groupUS = function(states){
for (i in 1:length(states)){
if (states[i] %in% c("CA", "OR", "WA", "NV", "ID", "MT", "WY", "UT", "CO", "AZ", "NM", "HI", "AK")){
states[i] = "West"
}
if (states[i] %in% c("PA", "NJ", "NY", "VT", "RI", "MA", "NH", "ME", "CT")){
states[i] = "NorthEast"
}
if (states[i] %in% c("TX", "OK", "AR", "LA", "MS", "AL", "TN", "KY", "WV", "DE", "MD", "DC", "VA", "NC", "SC", "GA", "FL")){
states[i] = "South"
}
if (states[i] %in% c("IL", "IN", "MI", "OH", "WI", "IA", "KS", "MN", "MO", "NE", "ND", "SD")){
states[i] = "North-Central"
}
}
states
}
college.regions = groupUS(college.states)
barplot(table(college.regions), col="#69b3a2")
library(usmap)
# maps of regions:
west_plot = usmap::plot_usmap(include = .west_region, labels = TRUE, fill = "red", alpha = 0.25)
south_plot = usmap::plot_usmap(include = .south_region, labels = TRUE, fill = "orange", alpha = 0.25)
northcentral_plot = usmap::plot_usmap(include = .north_central_region, labels = TRUE, fill = "green", alpha = 0.25)
northeast_plot = usmap::plot_usmap(include = .northeast_region, labels = TRUE, fill = "blue", alpha = 0.25)
# combined regions plot:
regions_plot = ggarrange(west_plot, south_plot, northcentral_plot, northeast_plot,
labels = c("West", "South", "North-Central", "North-East"),
ncol = 4, nrow = 1, align = "v"); regions_plot
It is interesting to consider the total cost (we will assume per semester) as it is done in the following notebook https://www.kaggle.com/lunamcbride24/college-exploration-and-regression. This would give us a better understanding later on when we apply unsupervised learning techniques. But first, it is important to notice that Outstate and Instate cannot be added at the same time. For this reason, we are going to consider the mean. As expected, variables that forms the TotalCost variable are removed.
# new variable:
getTotalCost = function(df){
cost = c()
for (row in 1:nrow(df)){
cost[row] = sum(mean(df$Out_State[row], df$In_State[row]), df$Room_Board[row], df$Books[row], df$Personal[row], df$Fees[row])
}
cost
}
data$TotalCost = getTotalCost(data)
# remove variables:
data$In_State = NULL
data$Out_State = NULL
data$Room_Board = NULL
data$Fees = NULL
data$Books = NULL
data$Personal = NULL
Also we are going to consider another variable, Total_Under, that refers to the number of Full time undergraduates, Full_Under, and part time undergraduates, Part_Under:
# new variable:
getTotalUnder = function(df){
under = c()
for (row in 1:nrow(df)){
under[row] = sum(df$Full_Under[row], df$Part_Under[row])
}
under
}
data$Total_Under = getTotalUnder(data)
# remove variables:
data$Full_Under = NULL
data$Part_Under = NULL
It is important to detect properly the outliers. For this reason, it is better to use box plots. We see that in Acc_Rate, Enrolled, Instructional and Total_Under we have many outliers. However, the decision after several tests is to keep them because they are colleges that somehow are going to be useful in order to group and rank universities.
data %>%
as.tibble() %>%
gather(variable, value) %>%
ggplot(aes(x=value)) +
geom_boxplot(fill= "#69b3a2") + facet_wrap(~variable, scale="free")
Before creating the models, we are going to create a copy of data that will not contain the Public/Private variable, in order to compare results of the outputs of the models.
# data without Public/Private:
out_data = data %>% select(-1)
It is highly important to dedicated some time in exploring the variables we have in our dataset before starting with the models.
summary(data)
## Public_Private Enrolled Top25perc Faculty_PhD
## Min. :1.000 Min. : 32.0 Min. : 9.0 Min. : 8.00
## 1st Qu.:1.000 1st Qu.: 239.2 1st Qu.: 39.0 1st Qu.: 61.00
## Median :2.000 Median : 443.5 Median : 52.5 Median : 74.00
## Mean :1.692 Mean : 796.1 Mean : 54.6 Mean : 71.91
## 3rd Qu.:2.000 3rd Qu.: 959.5 3rd Qu.: 68.0 3rd Qu.: 85.00
## Max. :2.000 Max. :6392.0 Max. :100.0 Max. :100.00
## Faculty_Terminal SF_Ratio Perc_Donate Instructional
## Min. : 20.00 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 68.00 1st Qu.:11.50 1st Qu.:12.00 1st Qu.: 6568
## Median : 81.00 Median :13.65 Median :20.00 Median : 8135
## Mean : 78.58 Mean :14.18 Mean :21.86 Mean : 9595
## 3rd Qu.: 91.00 3rd Qu.:16.68 3rd Qu.:30.00 3rd Qu.:10804
## Max. :100.00 Max. :39.80 Max. :64.00 Max. :62469
## Grad_Rate Acc_Rate TotalCost Total_Under
## Min. : 10.00 Min. :0.09139 Min. : 6260 Min. : 142
## 1st Qu.: 52.00 1st Qu.:0.67070 1st Qu.:13300 1st Qu.: 1230
## Median : 64.00 Median :0.77903 Median :15984 Median : 2240
## Mean : 63.99 Mean :0.74398 Mean :16790 Mean : 4767
## 3rd Qu.: 77.00 3rd Qu.:0.84986 3rd Qu.:19869 3rd Qu.: 5633
## Max. :100.00 Max. :1.00000 Max. :31560 Max. :38338
data %>%
as.tibble() %>%
gather(variable, value) %>%
ggplot(aes(x=value)) +
geom_histogram(fill= "#69b3a2") + facet_wrap(~variable, scale="free")
We can see that variability across the variables is highly different. In Instructional and Enrolled is low but in Grad_Rate and Top25perc is high. This points out that they could be good variables to rank observations. We can also see that the maximum of Acc_Rate is 1, which is surprising as every student is accepted.
Principal Component Analysis is an Unsupervised Learning tool that will let us reduce dimensionality of the variables.
We have seen previously that values are highly spread (percentages, ratios, total cost, etc…) so we are going to scale the data. PCA is going to maximize the variance in order to rank observations.
boxplot(scale(data), las=2, col="#69b3a2")
It is important to consider the correlation matrix in order to see hide correlations.
library(GGally)
ggcorr(data, label = T)
As expected, Faculty_Terminal and Faculty_PhD have a strong correlation basically because people with PhD are also considered as people with terminal degree. The same happens with Enrolled and Total_Under. For this reason, we are not going to consider Faculty_PhD and Enrolled. We also see some natural strong correlations such as Instructional and TotalCost (if the spend by the college is high, the total cost would be higher), and also but negative, Total_Under and Public/Private (public universities may have more students).
data$Enrolled = NULL
data$Faculty_PhD = NULL
ggcorr(data, label = T)
# model:
pca = prcomp(data, scale=T)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0162 1.4512 0.89580 0.86563 0.73314 0.69389 0.63434
## Proportion of Variance 0.4065 0.2106 0.08025 0.07493 0.05375 0.04815 0.04024
## Cumulative Proportion 0.4065 0.6171 0.69738 0.77231 0.82606 0.87421 0.91445
## PC8 PC9 PC10
## Standard deviation 0.59235 0.55129 0.44805
## Proportion of Variance 0.03509 0.03039 0.02008
## Cumulative Proportion 0.94953 0.97992 1.00000
Checking the screeplot, we notice that around 60% of total variance is explained by the two first components. This would we sufficient for our intentions. For this reason, we have only considered two components.
fviz_screeplot(pca, addlabels = TRUE, barfill = "#69b3a2", barcolor = "#69b3a2")
We can also see a plot with the loadings and notice that we may have two differentiated clusters. But which are them? We could guess that are public and private universities, but later we would confirm it.
library(ggfortify)
pca_plot = autoplot(pca, data = data,
label = TRUE, label.size = 3, shape = FALSE,
loadings = TRUE, loadings.colour = 'blue',
loadings.label = TRUE, loadings.label.size = 5)
library(plotly)
plotly::ggplotly(pca_plot)
First component:
How good is a University
fviz_contrib(pca, choice = "var", axes = 1, fill = "#69b3a2", color = "#69b3a2")
A minus sign is included just to better understand the plot:
barplot(-pca$rotation[,1], las=2, col="#69b3a2")
In this plot we need to interpret as usual the signs. So the value is lower if the bar is more negative, and the other way. SF_Ratio, Acc_Rate and Total_Under have different signs. This tell us what we should have in mind: a low value of SF_Ratio means that teaching is more personal as there are few students per teacher. Moreover, small values of Acc_Rate and Total_Under mean that getting accepted in the college is highly difficult so there are a small number of students in the university.
However, values of TotalCost, Instructional, Grad_Rate, Top25perc and the rest of the variables are high. This tell us that the total cost of a semester is high, the percentage of students in the top 25% is also high, as well as the graduation rate.
Taking all of this into account, we see that this features are those that can be associated to “Better Universities”. So, in this first PC, we can rank the universities with a sense of how good is that college.
Top 25 Universities:
head(college.names[order(pca$x[,1])], 25)
## [1] California Institute of Technology Johns Hopkins University
## [3] Yale University Harvard University
## [5] Dartmouth College Princeton University
## [7] Washington University University of Chicago
## [9] Amherst College Duke University
## [11] Williams College Massachusetts Institute of Technology
## [13] Wake Forest University Stanford University
## [15] Swarthmore College Cooper Union
## [17] Columbia University Middlebury College
## [19] University of Pennsylvania Harvey Mudd College
## [21] Bowdoin College Brown University
## [23] Emory University Pomona College
## [25] Wellesley College
## 938 Levels: Abilene Christian University Adelphi University ... Youngstown State University
Bottom 25 Universities:
tail(college.names[order(pca$x[,1])], 25)
## [1] Henderson State University
## [2] Northern Michigan University
## [3] Barber Scotia College
## [4] Northwest Missouri State University
## [5] Chadron State College
## [6] Wayne State College
## [7] Youngstown State University
## [8] Western New Mexico University
## [9] Indiana Univ. Northwest
## [10] Prairie View A. and M. University
## [11] Indiana Wesleyan University
## [12] University of Central Arkansas
## [13] West Liberty State College
## [14] West Texas A&M University
## [15] Arkansas Tech University
## [16] University of Southern Indiana
## [17] University of Texas at Arlington
## [18] East Central University
## [19] Weber State University
## [20] University of Sci. and Arts of Oklahoma
## [21] Metropolitan State College
## [22] Goldey Beacom College
## [23] Northeast Louisiana University
## [24] Mesa State College
## [25] Black Hills State University
## 938 Levels: Abilene Christian University Adelphi University ... Youngstown State University
Second component:
How demanding/popular and public is a University
fviz_contrib(pca, choice = "var", axes = 2, fill = "#69b3a2", color = "#69b3a2")
barplot(pca$rotation[,2], las=2, col="#69b3a2")
We see that variables Total_Under and Public/Private are the most important in this second PC. however, the signs are different. This tell us that the number of students in a college is crucial in this PC, and the Public/Private interpretation could be how public the university is.
For this reasons, this second PC indicates how demanding/popular and public a university is. So, now we can rank the universities by their second PC scores:
Most demanding Universities:
head(college.names[order(pca$x[,2])], 25)
## [1] University of Texas at Austin
## [2] Pennsylvania State Univ. Main Campus
## [3] Texas A&M Univ. at College Station
## [4] University of Minnesota Twin Cities
## [5] University of California at Los Angeles
## [6] University of California at Berkeley
## [7] Rutgers at New Brunswick
## [8] Ohio State University at Columbus
## [9] University of Florida
## [10] University of Central Florida
## [11] University of Washington
## [12] University of South Florida
## [13] University of Illinois - Urbana
## [14] Arizona State University Main campus
## [15] Florida State University
## [16] Michigan State University
## [17] Indiana University at Bloomington
## [18] North Carolina State University at Raleigh
## [19] University of Houston-Main Campus
## [20] University of Michigan at Ann Arbor
## [21] University of Wisconsin at Madison
## [22] Purdue University at West Lafayette
## [23] University of California at San Diego
## [24] San Diego State University
## [25] University of California at Davis
## 938 Levels: Abilene Christian University Adelphi University ... Youngstown State University
Least demanding Universities:
tail(college.names[order(pca$x[,2])], 25)
## [1] Dana College Lees-McRae College
## [3] St. Paul's College Wisconsin Lutheran College
## [5] Salem-Teikyo University Teikyo Post University
## [7] Pacific Christian College Molloy College
## [9] Mount Mary College Brescia College
## [11] Adelphi University Mount Marty College
## [13] Curry College College of St. Joseph
## [15] Dominican College of Blauvelt Alvernia College
## [17] Lasell College Siena Heights College
## [19] Montana State University - Northern Barber Scotia College
## [21] Marian College of Fond du Lac New England College
## [23] MidAmerica Nazarene College Mount Saint Clare College
## [25] Lourdes College
## 938 Levels: Abilene Christian University Adelphi University ... Youngstown State University
Once we have computed PCA considering the Public/Private variable, we could do the same but not considering it:
It is important to consider the correlation matrix in order to see hide correlations.
library(GGally)
ggcorr(out_data, label = T)
As expected, Faculty_Terminal and Faculty_PhD have a strong correlation basically because people with PhD are also considered as people with terminal degree. The same happens with Enrolled and Total_Under. For this reason, we are not going to consider Faculty_PhD and Enrolled.
out_data$Enrolled = NULL
out_data$Faculty_PhD = NULL
ggcorr(out_data, label = T)
# model:
pca_out = prcomp(out_data, scale=T)
summary(pca_out)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9653 1.2693 0.89578 0.86488 0.72732 0.68464 0.63098
## Proportion of Variance 0.4291 0.1790 0.08916 0.08311 0.05878 0.05208 0.04424
## Cumulative Proportion 0.4291 0.6081 0.69730 0.78041 0.83919 0.89127 0.93551
## PC8 PC9
## Standard deviation 0.57250 0.50266
## Proportion of Variance 0.03642 0.02807
## Cumulative Proportion 0.97193 1.00000
fviz_screeplot(pca_out, addlabels = TRUE, barfill = "#69b3a2", barcolor = "#69b3a2")
library(ggfortify)
pca_plot = autoplot(pca_out, data = out_data,
label = TRUE, label.size = 3, shape = FALSE,
loadings = TRUE, loadings.colour = 'blue',
loadings.label = TRUE, loadings.label.size = 5)
library(plotly)
plotly::ggplotly(pca_plot)
We can see that observations are not as separated as with the Public/Private variable. This is because now what is more important is Total_Under and we have seen previously in 2.7. that its variability is low, which indicates that observations are very similar in this variable.
First component:
How **good* is a University
fviz_contrib(pca_out, choice = "var", axes = 1, fill = "#69b3a2", color = "#69b3a2")
barplot(pca_out$rotation[,1], las=2, col="#69b3a2")
We see that the interpretation of this first PC without Public/Private is the same as with Public/Private. So, again we can rank the universities by their first PC scores:
Top 25 Universities:
head(college.names[order(pca_out$x[,1])], 25)
## [1] Black Hills State University
## [2] Goldey Beacom College
## [3] Mesa State College
## [4] Northeast Louisiana University
## [5] Barber Scotia College
## [6] Indiana Wesleyan University
## [7] University of Sci. and Arts of Oklahoma
## [8] Metropolitan State College
## [9] East Central University
## [10] Friends University
## [11] Weber State University
## [12] West Liberty State College
## [13] Voorhees College
## [14] University of Southern Indiana
## [15] Arkansas Tech University
## [16] Lee College
## [17] Brewton-Parker College
## [18] University of Texas at Arlington
## [19] West Texas A&M University
## [20] Western New Mexico University
## [21] University of Central Arkansas
## [22] Prairie View A. and M. University
## [23] Wayne State College
## [24] Chadron State College
## [25] Indiana Univ. Northwest
## 938 Levels: Abilene Christian University Adelphi University ... Youngstown State University
Bottom 25 Universities:
tail(college.names[order(pca_out$x[,1])], 25)
## [1] Wellesley College Pomona College
## [3] Emory University Harvey Mudd College
## [5] Bowdoin College Brown University
## [7] Middlebury College University of Pennsylvania
## [9] Columbia University Cooper Union
## [11] Swarthmore College Wake Forest University
## [13] Stanford University Williams College
## [15] Massachusetts Institute of Technology Duke University
## [17] Amherst College University of Chicago
## [19] Washington University Princeton University
## [21] Dartmouth College Harvard University
## [23] Yale University Johns Hopkins University
## [25] California Institute of Technology
## 938 Levels: Abilene Christian University Adelphi University ... Youngstown State University
Second component:
How **demanding/popular* is a University:
fviz_contrib(pca_out, choice = "var", axes = 2, fill = "#69b3a2", color = "#69b3a2")
barplot(pca_out$rotation[,2], las=2, col="#69b3a2")
We see that variables Total_Under is the most important in this second PC. This is a significant different because we have no more the Public/Private variable. So, this indicates how demanding or popular a university is.
Now we can rank the universities by their second PC scores:
Most demanding Universities:
head(college.names[order(pca_out$x[,2])], 25)
## [1] University of Texas at Austin
## [2] Texas A&M Univ. at College Station
## [3] Pennsylvania State Univ. Main Campus
## [4] University of Minnesota Twin Cities
## [5] University of Central Florida
## [6] Arizona State University Main campus
## [7] Ohio State University at Columbus
## [8] Rutgers at New Brunswick
## [9] University of California at Berkeley
## [10] Florida State University
## [11] University of South Florida
## [12] University of California at Los Angeles
## [13] University of North Texas
## [14] University of Florida
## [15] University of Houston-Main Campus
## [16] Brigham Young University at Provo
## [17] Indiana University at Bloomington
## [18] Michigan State University
## [19] University of Illinois - Urbana
## [20] San Diego State University
## [21] Purdue University at West Lafayette
## [22] North Carolina State University at Raleigh
## [23] University of Washington
## [24] University of Texas at San Antonio
## [25] Texas Tech University
## 938 Levels: Abilene Christian University Adelphi University ... Youngstown State University
Least demanding Universities:
tail(college.names[order(pca_out$x[,2])], 25)
## [1] Wheelock College Mount Marty College
## [3] Alvernia College McPherson College
## [5] Adelphi University MidAmerica Nazarene College
## [7] Salem-Teikyo University Wisconsin Lutheran College
## [9] Mary Baldwin College Molloy College
## [11] College of Mount St. Joseph Mount Mary College
## [13] Siena Heights College Marian College of Fond du Lac
## [15] College of St. Joseph Goshen College
## [17] Pine Manor College Mount Saint Clare College
## [19] Curry College Lourdes College
## [21] Principia College Brescia College
## [23] New England College Lasell College
## [25] Montana State University - Northern
## 938 Levels: Abilene Christian University Adelphi University ... Youngstown State University
Once we have done the different PCA models, we can compare them and select what configuration best suits with our goals. We see that both models are very similar in terms of the meaning of the principal components. In this case the preferred choice is:
Data with Public/Private, (4.1.), because this variable allow us to rank universities by a quality indicator.
Factor Analysis is an analytical tool where the focus is to explain correlations between indicators (by common factors), and will be made by reducing the dimension.
As well as with PCA, we will consider two approaches: considering the Public/Private indicator and not. Moreover, we would consider also two and three factors respectively.
f = 2
r = "varimax"
s = "Bartlett"
data_fa = factanal(data, factors = f, rotation=r, scores=s)
data_fa
##
## Call:
## factanal(x = data, factors = f, scores = s, rotation = r)
##
## Uniquenesses:
## Public_Private Top25perc Faculty_Terminal SF_Ratio
## 0.182 0.450 0.487 0.557
## Perc_Donate Instructional Grad_Rate Acc_Rate
## 0.632 0.394 0.571 0.715
## TotalCost Total_Under
## 0.257 0.444
##
## Loadings:
## Factor1 Factor2
## Public_Private 0.904
## Top25perc 0.738
## Faculty_Terminal 0.694 -0.180
## SF_Ratio -0.376 -0.549
## Perc_Donate 0.411 0.446
## Instructional 0.726 0.282
## Grad_Rate 0.512 0.409
## Acc_Rate -0.532
## TotalCost 0.695 0.510
## Total_Under 0.254 -0.701
##
## Factor1 Factor2
## SS loadings 2.956 2.356
## Proportion Var 0.296 0.236
## Cumulative Var 0.296 0.531
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 360.77 on 26 degrees of freedom.
## The p-value is 1.21e-60
We can see that the proportion of total variance that it is explained is 53%. The meaning of the First factor is how good a university is, in spite of the Second factor, that explain the popularity.
f = 2
r = "none"
s = "regression"
data_fa = factanal(data, factors = f, rotation=r, scores=s)
data_fa
##
## Call:
## factanal(x = data, factors = f, scores = s, rotation = r)
##
## Uniquenesses:
## Public_Private Top25perc Faculty_Terminal SF_Ratio
## 0.182 0.450 0.487 0.557
## Perc_Donate Instructional Grad_Rate Acc_Rate
## 0.632 0.394 0.571 0.715
## TotalCost Total_Under
## 0.257 0.444
##
## Loadings:
## Factor1 Factor2
## Public_Private 0.717 -0.551
## Top25perc 0.520 0.529
## Faculty_Terminal 0.294 0.654
## SF_Ratio -0.664
## Perc_Donate 0.605
## Instructional 0.674 0.390
## Grad_Rate 0.639 0.144
## Acc_Rate -0.304 -0.438
## TotalCost 0.833 0.223
## Total_Under -0.388 0.637
##
## Factor1 Factor2
## SS loadings 3.477 1.835
## Proportion Var 0.348 0.184
## Cumulative Var 0.348 0.531
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 360.77 on 26 degrees of freedom.
## The p-value is 1.21e-60
But with this configuration also we get exactly the same total variance.
For this reason we will consider also three factors. But first, let´s plot the two factors in a 2D graph just to see if there are some hidden insights:
fa_plot = autoplot(data_fa, label = TRUE, label.size = 3, shape = FALSE,
loadings = TRUE, loadings.label = TRUE, loadings.label.size = 5)
plotly::ggplotly(fa_plot)
It is very interesting how the plot of the Factor Analysis model is represented. We can see two differentiated groups that according to the loadings, correspond to the Public_Private indicator. Let´s check it:
fa_plot = autoplot(data_fa, label = TRUE, label.size = 3, shape = FALSE, colour = college.pub_priv, legend = TRUE,
loadings = TRUE, loadings.label = TRUE, loadings.label.size = 5)
plotly::ggplotly(fa_plot)
Previously we have seen that with two factors the total variance that the model explained was around 53% (independent of the rotation and scores used). For this reason we would consider three factors to see if it is better.
f = 3
r = "varimax"
s = "Bartlett"
data_fa = factanal(data, factors = f, rotation=r, scores=s)
data_fa
##
## Call:
## factanal(x = data, factors = f, scores = s, rotation = r)
##
## Uniquenesses:
## Public_Private Top25perc Faculty_Terminal SF_Ratio
## 0.171 0.419 0.457 0.515
## Perc_Donate Instructional Grad_Rate Acc_Rate
## 0.605 0.005 0.513 0.712
## TotalCost Total_Under
## 0.280 0.450
##
## Loadings:
## Factor1 Factor2 Factor3
## Public_Private 0.901
## Top25perc 0.708 0.283
## Faculty_Terminal 0.664 -0.251 0.196
## SF_Ratio -0.223 -0.487 -0.445
## Perc_Donate 0.454 0.402 0.164
## Instructional 0.401 0.149 0.901
## Grad_Rate 0.581 0.362 0.138
## Acc_Rate -0.400 0.105 -0.341
## TotalCost 0.622 0.424 0.391
## Total_Under 0.183 -0.718
##
## Factor1 Factor2 Factor3
## SS loadings 2.287 2.134 1.450
## Proportion Var 0.229 0.213 0.145
## Cumulative Var 0.229 0.442 0.587
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 172.22 on 18 degrees of freedom.
## The p-value is 3.31e-27
We can see that the proportion of total variance that it is explained is 59%.
f = 3
r = "none"
s = "regression"
data_fa = factanal(data, factors = f, rotation=r, scores=s)
data_fa
##
## Call:
## factanal(x = data, factors = f, scores = s, rotation = r)
##
## Uniquenesses:
## Public_Private Top25perc Faculty_Terminal SF_Ratio
## 0.171 0.419 0.457 0.515
## Perc_Donate Instructional Grad_Rate Acc_Rate
## 0.605 0.005 0.513 0.712
## TotalCost Total_Under
## 0.280 0.450
##
## Loadings:
## Factor1 Factor2 Factor3
## Public_Private 0.260 0.872
## Top25perc 0.546 0.530
## Faculty_Terminal 0.410 -0.266 0.551
## SF_Ratio -0.570 -0.400
## Perc_Donate 0.401 0.375 0.307
## Instructional 0.997
## Grad_Rate 0.424 0.343 0.436
## Acc_Rate -0.454 0.158 -0.237
## TotalCost 0.677 0.358 0.366
## Total_Under -0.704 0.232
##
## Factor1 Factor2 Factor3
## SS loadings 2.858 1.899 1.115
## Proportion Var 0.286 0.190 0.111
## Cumulative Var 0.286 0.476 0.587
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 172.22 on 18 degrees of freedom.
## The p-value is 3.31e-27
The result is that it explain 59% of total variance, slightly higher than with two factors. Not enough to say that it is a better model than with two factors.
Once we have computed the models considering the Public/Private variable, we could do the same but not considering it:
f = 2
r = "varimax"
s = "Bartlett"
out_data_fa = factanal(out_data, factors = f, rotation=r, scores=s)
out_data_fa
##
## Call:
## factanal(x = out_data, factors = f, scores = s, rotation = r)
##
## Uniquenesses:
## Top25perc Faculty_Terminal SF_Ratio Perc_Donate
## 0.426 0.496 0.516 0.626
## Instructional Grad_Rate Acc_Rate TotalCost
## 0.381 0.580 0.723 0.293
## Total_Under
## 0.439
##
## Loadings:
## Factor1 Factor2
## Top25perc 0.372 0.660
## Faculty_Terminal 0.148 0.694
## SF_Ratio -0.692
## Perc_Donate 0.586 0.173
## Instructional 0.611 0.496
## Grad_Rate 0.579 0.291
## Acc_Rate -0.224 -0.476
## TotalCost 0.733 0.412
## Total_Under -0.520 0.539
##
## Factor1 Factor2
## SS loadings 2.550 1.970
## Proportion Var 0.283 0.219
## Cumulative Var 0.283 0.502
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 308.44 on 19 degrees of freedom.
## The p-value is 3.71e-54
f = 2
r = "none"
s = "Bartlett"
out_data_fa = factanal(out_data, factors = f, rotation=r, scores=s)
out_data_fa
##
## Call:
## factanal(x = out_data, factors = f, scores = s, rotation = r)
##
## Uniquenesses:
## Top25perc Faculty_Terminal SF_Ratio Perc_Donate
## 0.426 0.496 0.516 0.626
## Instructional Grad_Rate Acc_Rate TotalCost
## 0.381 0.580 0.723 0.293
## Total_Under
## 0.439
##
## Loadings:
## Factor1 Factor2
## Top25perc 0.695 0.303
## Faculty_Terminal 0.536 0.465
## SF_Ratio -0.595 0.360
## Perc_Donate 0.572 -0.215
## Instructional 0.786
## Grad_Rate 0.638 -0.117
## Acc_Rate -0.466 -0.245
## TotalCost 0.833 -0.112
## Total_Under 0.744
##
## Factor1 Factor2
## SS loadings 3.395 1.124
## Proportion Var 0.377 0.125
## Cumulative Var 0.377 0.502
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 308.44 on 19 degrees of freedom.
## The p-value is 3.71e-54
Both models explain the 50% of variance with two factors. The meaning of the First factor is also how good a university is, in spite of the Second factor, that explain the popularity.
Let´s see the plot:
fa_plot = autoplot(out_data_fa, label = TRUE, label.size = 3, shape = FALSE,
loadings = TRUE, loadings.label = TRUE, loadings.label.size = 5)
plotly::ggplotly(fa_plot)
Now it is not that clear as with two factors. However, to remark something is that the observations have like a square distribution, in which the branches are determined by the most important variables in each factor, Total_Under and for example, Total_Cost.
We have considered 2 variations in the data (with and without Public/Private), 2 different numbers of factor (2 and 3) and 2 different configurations of rotation and scores. In total 8 different models. However, the results were very similar in terms of total variance explained. For our purpose, the choice is:
Two factors with rotation = “varimax” and scores = “Bartlett”, because variance is 53% and in the plot we saw that colleges are grouped good by a sense of Public/Private.
The final unsupervised learning tool that we will use in this project is clustering. The focus is to reduce dimensionality of observations by grouping them into clusters. The interesting and preferred thing is to maximize the distance between different clusters and minimize it between observations of the same group.
We would consider different approaches of clustering: - K-Means - Mahalanobis K-Means - Hierarchical - PAM - Kernel K-Means
First, we would load some crucial libraries for clustering:
library(factoextra)
library(cluster)
library(mclust)
Because clustering is a tool that uses distances, the recommendable thing to do is to scale the data.
X_data = scale(data)
It is also important to determine the number of optimal clusters. We would make some plots using silhouette and WSS methods to determine it. We see that we could choose either 2 and 3 (gap_stat is not considered because it is more time consuming and the results of the previous methods are acceptable).
fviz_nbclust(X_data, kmeans, method = "silhouette", nstart = 1000, linecolor = "#69b3a2")
fviz_nbclust(X_data, kmeans, method = "wss", nstart = 1000, linecolor = "#69b3a2")
Before starting with the models, let´s consider the PCA plot in order to see how the observations are going to be placed and to realize the meaning with the loadings of the possible clusters:
library(ggfortify)
pca_plot = autoplot(pca, data = data,
label = TRUE, label.size = 3, shape = FALSE,
loadings = TRUE, loadings.colour = 'red',
loadings.label = TRUE, loadings.label.size = 5)
library(plotly)
plotly::ggplotly(pca_plot)
Above average universities vs Below average universities.
# K=2:
k = 2
km2 = kmeans(scale(data), center = k, nstart = 1000)
fviz_cluster(km2, data = scale(data), geom = c("point"),ellipse.type = 'norm', pointsize=1, main = "K-Means clusplot with 2 clusters")+
theme_minimal()+geom_text(label=row.names(data),hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")
We can see that we have two differentiated groups. According to the previous interpretation in 4.1. of the PC, we see that the two clusters are more differentiated by the first component. More to the left indicates that the college is better than the average college and the other way to the right.
Let´s see how good are the groups:
As we can see in both plots, groups can be considered as good taking into account the silhouette widths. On the other hand, they are not that well-balanced but this seems acceptable.
barplot(table(km2$cluster), col="#69b3a2", main = "Number of colleges per cluster", sub = " Hypothesis: 1 Above Av. / 2 Below Av.")
d = dist(data, method="euclidean")
data_sil = silhouette(km2$cluster, d)
plot(data_sil, col=1:2, main="", border=NA)
Above average universities vs Below average public universities vs Below average private universities.
#K=3:
k = 3
km3 = kmeans(scale(data), center = k, nstart = 1000)
fviz_cluster(km3, data = scale(data), geom = c("point"),ellipse.type = 'norm', pointsize=1, main = "K-Means clusplot with 3 clusters")+
theme_minimal()+geom_text(label=row.names(data),hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")
We can see that we have three differentiated groups. According to the previous interpretation in 4.1. of the PC, we see that the interpretation of the three clusters is similar to the one with two clusters. More to the left implies that the college is better and more up indicates private.
Let´s see how good are the groups:
As we can see in both plots, groups can be considered also as good taking into account the silhouette widths. On the other hand, they are not that well-balanced but this seems acceptable as with two clusters.
barplot(table(km3$cluster), col="#69b3a2", main = "Number of colleges per cluster", sub = "Hypothesis: 1 Below Av. Private / 2 Below Av.Public / 3 Above Av")
d = dist(data, method="euclidean")
data_sil = silhouette(km3$cluster, d)
plot(data_sil, col=1:3, main="", border=NA)
Now, we could make a change in the input of the K-Means model: distance of the observations. In this case we would make use of the Mahalanobis distance. In R there is not a specific library or function to apply directly it, so we would need to construct it step by step:
S_x = cov(data) # covariance matrix
inv_Sx = solve(S_x) # S_x^-1 (inverse of S_x)
eigen_Sx = eigen(inv_Sx)
vector_Sx <- eigen_Sx$vectors
Y = vector_Sx %*% diag(sqrt(eigen_Sx$values)) %*% t(vector_Sx) # inv_Sx^1/2
X_til = scale(data, scale = FALSE)
data_Sx = X_til %*% Y
Private universities vs Public universities
Let´s consider first two clusters:
k = 2
km2.mahalanobis = kmeans(data_Sx, centers=k, nstart=1000)
In the clusplot we see that the two clusters are well differentiated. Taking into account the interpretation of the PC, we realize that actually this clusters represents private and public universities.
fviz_cluster(km2.mahalanobis, data, geom = c("point"),ellipse.type = 'norm', pointsize=1, main = "K-Means with Mahalanobis distance clusplot with 2 clusters")+
theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")
This interpretation is precise as we can see in the following plot. For cluster 1 we have that most of colleges are Private, and for cluster 2 we have Public universities:
as.data.frame(X_data) %>% mutate(cluster=factor(km2.mahalanobis$cluster), names=college.names, pub_priv=college.pub_priv) %>%
ggplot(aes(x=as.factor(pub_priv), fill=as.factor(pub_priv) )) + geom_bar( ) + facet_wrap(~cluster) + scale_fill_brewer(palette = "Greens") + labs(title = "Distribution of Private (1) and Public (2) by cluster", x = "", y = "", fill = "Type (1: Public / 2: Private)")
Above Av. Private universities vs Below Av. Private universities vs Public universities
Let´s consider now 3 clusters and compare the results with 2 clusters:
k = 3
km3.mahalanobis = kmeans(data_Sx, centers=k, nstart=1000)
In this case the interpretation is not as easy as with two clusters.
fviz_cluster(km3.mahalanobis, data, geom = c("point"),ellipse.type = 'norm', pointsize=1, main = "K-Means with Mahalanobis distance clusplot with 2 clusters")+
theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")
However, if we remember the plot with 3 clusters in K-Means in 6.1.1., we can see that is similar but not the same. In this case we have:
adjustedRandIndex(km3.mahalanobis$cluster, km3$cluster)
## [1] 0.5567912
Another clustering tool but different from the ones we have already tested is Hierarchical Clustering. In this case, the focus is to give some hierarchy to the observations represented in a dendogram. Close observations will be grouped until we get one group with all the observations. A downside of this tool is that for large data sets is difficult to manage due to time consuming and interpretation of the results. For this reason:
Code is going to be commented to avoid time-consuming and visualization problems
However, we are going to consider different distances and methods, and that the optimal clusters are k = 2. Notice that phylogenic trees are also not considered for the same reason as the dendograms.
# dis = "euclidean"
# met = "complete"
# d = dist(data, method = dis)
# hc = hclust(d, method = met)
# fviz_dend(hc, k = 2, cex = 0.6, rect = T, repel = T)
# dis = "manhattan"
# met = "single"
# d = dist(data, method = dis)
# hc = hclust(d, method = met)
# fviz_dend(hc, k = 2, cex = 0.6, rect = T, repel = T)
# dis = "canberra"
# met = "average"
# d = dist(data, method = dis)
# hc = hclust(d, method = met)
# fviz_dend(hc, k = 2, cex = 0.6, rect = T, repel = T)
Before starting with the models we need to determine the number of optimal clusters. We would make some plots using silhouette and WSS methods to determine it. We see that we could choose either 2 and 3 (gap_stat is not considered because it is more time consuming and the results of the previous methods are acceptable).
fviz_nbclust(X_data, kmeans, method = "silhouette", nstart = 1000, linecolor = "#69b3a2")
fviz_nbclust(X_data, kmeans, method = "wss", nstart = 1000, linecolor = "#69b3a2")
Private universities vs Public universities
data_pam2 = eclust(X_data, "pam", stand=TRUE, k=2, graph=F)
fviz_cluster(data_pam2, data = X_data, geom = c("point"), pointsize=1, main = "PAM clusplot with 2 clusters")+
theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")
In PAM, with 2 clusters we have the same interpretation than in 6.1.2. (Mahalanobis with 2 clusters):
adjustedRandIndex(km2.mahalanobis$cluster, data_pam2$clustering)
## [1] 0.8301378
Above average universities vs Below average public universities vs Below average private universities.
data_pam3 = eclust(X_data, "pam", stand=TRUE, k=3, graph=F)
fviz_cluster(data_pam3, data = X_data, geom = c("point"), pointsize=1, main = "PAM clusplot with 3 clusters")+
theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")
With 3 clusters we also have the same interpretation than in 6.1.1. (K-Means with 3 clusters):
adjustedRandIndex(km3$cluster, data_pam3$clustering)
## [1] 0.6533104
Kernel K-Means is a tool that allow us to capture non-linearly separable clusters in the input dimension.
library(kernlab)
We can choose different kernels to see which is the one that best fit with our data:
With two clusters and gaussian kernel and taking into account the plot at the beginning of section 6, we see that the interpretation could be:
More know universities vs Less known universities.
# model:
ker = "rbfdot"
data_ker = kkmeans(as.matrix(X_data), centers=2, kernel=ker)
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
# plot:
obj.data_ker21 = list(data = X_data, cluster = data_ker@.Data)
fviz_cluster(obj.data_ker21, geom = c("point"), ellipse=F,pointsize=1)+
theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")
Above average universities vs Below average universities.
# model:
ker = "polydot"
data_ker = kkmeans(as.matrix(X_data), centers=2, kernel=ker)
## Setting default kernel parameters
# plot:
obj.data_ker22 = list(data = X_data, cluster = data_ker@.Data)
fviz_cluster(obj.data_ker22, geom = c("point"), ellipse=F,pointsize=1)+
theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")
With two clusters and polynomial kernel we see that it is a similar interpretation than in 6.1.1. (K-Means with two clusters).
adjustedRandIndex(km2$cluster, obj.data_ker22$cluster)
## [1] 0.9451335
With three clusters and gaussian kernel and taking into account the plot at the beginning of section 6, we see that the interpretation could be:
More know universities vs More or less known universities vs Less known universities.
# model:
ker = "rbfdot"
data_ker = kkmeans(as.matrix(X_data), centers=3, kernel=ker)
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
# plot:
obj.data_ker31 = list(data = X_data, cluster = data_ker@.Data)
fviz_cluster(obj.data_ker31, geom = c("point"), ellipse=F,pointsize=1)+
theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")
Above average universities vs Below average public universities vs Below average private universities.
# model:
ker = "laplacedot"
data_ker = kkmeans(as.matrix(X_data), centers=3, kernel=ker)
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
# plot:
obj.data_ker32 = list(data = X_data, cluster = data_ker@.Data)
fviz_cluster(obj.data_ker32, geom = c("point"), ellipse=F,pointsize=1)+
theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")
With three clusters and polynomial kernel we see that the interpretation is a similar to the one in 6.1.1. (K-Means with three clusters).
adjustedRandIndex(km3$cluster, obj.data_ker32$cluster)
## [1] 0.5760837
Once we have computed the models considering the Public/Private variable, we could do the same but not considering it:
Because clustering is a tool that uses distances, the recommendable thing to do is to scale the data.
X_out_data = scale(out_data)
It is also important to determine the number of optimal clusters. We would make some plots using silhouette and WSS methods to determine it. We see that we could choose 2 (gap_stat is not considered because it is more time consuming and the results of the previous methods are acceptable.
fviz_nbclust(X_out_data, kmeans, method = "silhouette", nstart = 1000, linecolor = "#9E1030FF")
fviz_nbclust(X_out_data, kmeans, method = "wss", nstart = 1000, linecolor = "#9E1030FF")
Before starting with the models, let´s consider the PCA plot in order to see how the observations are going to be placed and to realize the meaning with the loadings of the possible clusters:
library(ggfortify)
pca_plot = autoplot(pca_out, data = out_data,
label = TRUE, label.size = 3, shape = FALSE,
loadings = TRUE, loadings.colour = 'red',
loadings.label = TRUE, loadings.label.size = 5)
library(plotly)
plotly::ggplotly(pca_plot)
Above average universities vs Below average universities.
# K=2:
k = 2
km2_out = kmeans(scale(out_data), center = k, nstart = 1000)
fviz_cluster(km2_out, data = scale(out_data), geom = c("point"),ellipse.type = 'norm', pointsize=1, main = "K-Means clusplot with 2 clusters")+
theme_minimal()+geom_text(label=row.names(out_data),hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Purples")
We can see that we have two differentiated groups. According to the previous interpretation in 4.1. of the PC, we see that the two clusters are more differentiated by the first component. More to the left indicates that the college is better than the average college and the other way to the right.
Let´s see how good are the groups:
As we can see in both plots, groups can be considered as good taking into account the silhouette widths. On the other hand, they are not that well-balanced but this seems acceptable.
barplot(table(km2_out$cluster), col="#9E1030FF", main = "Number of colleges per cluster", sub = " Hypothesis: 1 Above Av. / 2 Below Av.")
d = dist(out_data, method="euclidean")
out_data_sil = silhouette(km2_out$cluster, d)
plot(out_data_sil, col=1:2, main="", border=NA)
summary(out_data_sil)
## Silhouette of 938 units in 2 clusters from silhouette.default(x = km2_out$cluster, dist = d) :
## Cluster sizes and average silhouette widths:
## 303 635
## 0.1618806 0.3719111
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.5005 0.2018 0.3716 0.3041 0.4811 0.5526
Now, we could make a change in the input of the K-Means model: distance of the observations. In this case we would make use of the Mahalanobis distance. In R there is not a specific library or function to apply directly it, so we would need to construct it step by step:
S_x_out = cov(out_data) # covariance matrix
inv_Sx_out = solve(S_x_out) # S_x^-1 (inverse of S_x)
eigen_Sx_out = eigen(inv_Sx_out)
vector_Sx_out <- eigen_Sx_out$vectors
Y_out = vector_Sx_out %*% diag(sqrt(eigen_Sx_out$values)) %*% t(vector_Sx_out) # inv_Sx_out^1/2
X_til_out = scale(out_data, scale = FALSE)
out_data_Sx = X_til_out %*% Y_out
More popular universities vs Less popular universities
Let´s consider first two clusters:
k = 2
km2_out.mahalanobis = kmeans(out_data_Sx, centers=k, nstart=1000)
In the clusplot we see that the interpretation is not straightforward because we have some overlapping on the groups. This model seems that it is not properly working for our intentions. WE can guess that the meaning is more popular and less popular universities.
fviz_cluster(km2_out.mahalanobis, out_data, geom = c("point"),ellipse.type = 'norm', pointsize=1, main = "K-Means with Mahalanobis distance clusplot with 2 clusters")+
theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Purples")
Another clustering tool but different from the ones we have already tested is Hierarchical Clustering. In this case, the focus is to give some hierarchy to the observations represented in a dendogram. Close observations will be grouped until we get one group with all the observations. A downside of this tool is that for large data sets is difficult to manage due to time consuming and interpretation of the results. For this reason:
Code is going to be commented to avoid time-consuming and visualization problems
However, we are going to consider different distances and methods, and that the optimal clusters are k = 2. Notice that phylogenic trees are also not considered for the same reason as the dendograms.
# dis = "euclidean"
# met = "complete"
# d = dist(out_data, method = dis)
# hc = hclust(d, method = met)
# fviz_dend(hc, k = 2, cex = 0.6, rect = T, repel = T)
# dis = "manhattan"
# met = "single"
# d = dist(out_data, method = dis)
# hc = hclust(d, method = met)
# fviz_dend(hc, k = 2, cex = 0.6, rect = T, repel = T)
# dis = "canberra"
# met = "average"
# d = dist(out_data, method = dis)
# hc = hclust(d, method = met)
# fviz_dend(hc, k = 2, cex = 0.6, rect = T, repel = T)
Before starting with the models we need to determine the number of optimal clusters. We would make some plots using silhouette and WSS methods to determine it. We see that we could choose 2 (gap_stat is not considered because it is more time consuming and the results of the previous methods are acceptable.
fviz_nbclust(X_out_data, kmeans, method = "silhouette", nstart = 1000, linecolor = "#9E1030FF")
fviz_nbclust(X_out_data, kmeans, method = "wss", nstart = 1000, linecolor = "#9E1030FF")
Above average universities vs Below average universities
out_data_pam2 = eclust(X_out_data, "pam", stand=TRUE, k=2, graph=F)
fviz_cluster(out_data_pam2, data = X_out_data, geom = c("point"), pointsize=1, main = "PAM clusplot with 2 clusters")+
theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Purples")
Kernel K-Means is a tool that allow us to capture non-linearly separable clusters in the input dimension.
library(kernlab)
We can choose different kernels to see which is the one that best fit with our data:
With two clusters and gaussian kernel and taking into account the plot at the beginning of section 6, we see that the interpretation could be:
More know universities vs Less known universities.
# model:
ker = "rbfdot"
out_data_ker = kkmeans(as.matrix(X_out_data), centers=2, kernel=ker)
## Using automatic sigma estimation (sigest) for RBF or laplace kernel
# plot:
obj.out_data_ker21 = list(data = X_out_data, cluster = out_data_ker@.Data)
fviz_cluster(obj.out_data_ker21, geom = c("point"), ellipse=F,pointsize=1)+
theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Purples")
Above average universities vs Below average universities.
# model:
ker = "polydot"
out_data_ker = kkmeans(as.matrix(X_out_data), centers=2, kernel=ker)
## Setting default kernel parameters
# plot:
obj.out_data_ker22 = list(data = X_out_data, cluster = out_data_ker@.Data)
fviz_cluster(obj.out_data_ker22, geom = c("point"), ellipse=F,pointsize=1)+
theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Purples")
Computing the adjustedRandIndex would give us a tool to see if models with the two datasets are similar or not. We can only compare models with 2 clusters, and we see that most of the models are around 50%, so this tell us that adding the Public/Private variable is determinant in the outcome. However, in the models with K_Means the value is 65% and that is due to the interpretation was Above/Below average, that has nothing to do with if the college is private or public.
adjustedRandIndex(km2$cluster, km2_out$cluster)
## [1] 0.6334568
K-Means 2 clusters
adjustedRandIndex(km2.mahalanobis$cluster, km2_out.mahalanobis$cluster)
## [1] 0.4952633
Mahalanobis 2 clusters
adjustedRandIndex(data_pam2$cluster, out_data_pam2$cluster)
## [1] 0.08283191
PAM 2 clusters
adjustedRandIndex(obj.data_ker21$cluster, obj.out_data_ker22$cluster)
## [1] 0.001535154
Kernel 2 clusters
After computing the different models for each dataset (with and without Public/Private) and comparing them, we can select the one that best fit with our intention. In this case, we would choose two models, one for two and one for three clusters. This is because with two clusters it is interesting the Public/Private groups, and with three clusters the difference according to a quality indicator (above/below average).
2 clusters: Private universities vs Public universities
k = 2
km2.mahalanobis = kmeans(data_Sx, centers=k, nstart=1000)
fviz_cluster(km2.mahalanobis, data, geom = c("point"),ellipse.type = 'norm', pointsize=1, main = "K-Means with Mahalanobis distance clusplot with 2 clusters")+
theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")
3 clusters: Above average universities vs Below average public universities vs Below average private universities.
data_pam3 = eclust(X_data, "pam", stand=TRUE, k=3, graph=F)
fviz_cluster(data_pam3, data = X_data, geom = c("point"), pointsize=1, main = "PAM clusplot with 3 clusters")+
theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")
df = data.frame(cluster=data_pam3$cluster, state=college.states) %>% group_by(state) %>%
summarise(X = mean(cluster))
plot_usmap(data=df, values = "X") + scale_fill_continuous(
low = "white", high = "red", name = "Population (2015)")
After selecting the best models we can use the profiles defined in 2.5 to answer some natural questions:
Better universities are the ones which TotalCost is higher?
These should be a natural question that should be affirmative, and in fact, it is:
p1 = data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>%
ggplot(aes(z1,z2,label=college.names,color=data$TotalCost)) + geom_point(size=0) +
labs(title="Universities map according to TotalCost", x="PC1", y="PC2") +
theme_bw() + scale_color_gradient(low="lightblue", high="darkblue")+theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE)
p2 = fviz_cluster(data_pam3, data = X_data, geom = c("point"), pointsize=1, main = "PAM clusplot with 3 clusters")+
theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")
ggarrange(p1, p2, ncol = 1, nrow = 2)
Better universities are Private? Comparing the plot of the model of K-means selected and the graph of the distribution of colleges according to Public/Private, we confirm that our initial guess is true:
p1 = data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>%
ggplot(aes(z1,z2,label=college.names,color=college.pub_priv)) + geom_point(size=0) +
labs(title="Universities map according to Public/Private", x="PC1", y="PC2") +
theme_bw() + scale_color_gradient(low="lightblue", high="darkblue")+theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE)
p2 = fviz_cluster(km2.mahalanobis, data, geom = c("point"),ellipse.type = 'norm', pointsize=1, main = "K-Means with Mahalanobis distance clusplot with 2 clusters")+
theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")
ggarrange(p1, p2, ncol = 1, nrow = 2)
Better universities are the ones that have low Acceptance rate?
We can compare the plot of the clustering model we have selected and a plot with the distribution of colleges according to Acc_Rate. We see that, effectively, acceptance rate or in other words, exclusivity, affects on how good a college is:
p1 = data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>%
ggplot(aes(z1,z2,label=college.names,color=data$Acc_Rate)) + geom_point(size=0) +
labs(title="Universities map according to Acc_Rate", x="PC1", y="PC2") +
theme_bw() + scale_color_gradient(low="yellow", high="darkblue")+theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE)
p2 = fviz_cluster(data_pam3, data = X_data, geom = c("point"), pointsize=1, main = "PAM clusplot with 3 clusters")+
theme_minimal()+geom_text(label=college.names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Greens")
ggarrange(p1, p2, ncol = 1, nrow = 2)
Does the Region affects the position of universities?
We cannot affirm that because there is no clear pattern in general. However, what we could say that better and private universities are more likely to be in the NorthEast region. This is a natural answer as we take into account that in that region is where the Ivy League is (https://www.usnews.com/education/best-colleges/ivy-league-schools).
p1 = data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>%
ggplot(aes(z1,z2,label=college.names,color=college.elite)) + geom_point(size=0) +
labs(title="PCA", x="PC1", y="PC2") +
theme_bw() + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE)
p2 = data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>%
ggplot(aes(z1,z2,label=college.states,color=college.regions)) + geom_point(size=0) +
labs(title="Universities map according to TotalCost", x="PC1", y="PC2") +
theme_bw() + theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE)
ggarrange(p1, p2, ncol = 1, nrow = 2)
We can also confirm it by looking at the us map for TotalCost and Acc_Rate. We see that universities in the NorthEast region (IvyLeague) and in the West (Caltech, Stanford,…) have a higher cost:
df = data.frame(cost=data$TotalCost, state=college.states) %>% group_by(state) %>%
summarise(X = median(cost))
p1 = plot_usmap(data=df, values = "X") + scale_fill_continuous(
low = "white", high = "#CB454A", name = "Total Cost") + labs(title = "Distribution of universities in the US by Total Cost")
df = data.frame(acc=data$Acc_Rate, state=college.states) %>% group_by(state) %>%
summarise(X = mean(acc))
p2 = plot_usmap(data=df, values = "X") + scale_fill_continuous(
low = "#CB454A", high = "white", name = "Acc Rate") + labs(title = "Distribution of universities in the US by Acceptance Rate")
ggarrange(p1, p2, ncol = 2, nrow = 1)
We have more or less clear that better universities are placed mainly in NorthEast and West regions, but let´s confirm it with the previous profile Elite universities:
df = data.frame(acc=college.elite, state=college.states) %>% group_by(state) %>%
summarise(X = sum(acc))
plot_usmap(data=df, values = "X") + scale_fill_continuous(
low = "white", high = "#CB454A", name = "Acc Rate") + labs(title = "Distribution of Elite universities in the US")
We can conclude that, effectively, better universities are in that regions.